计算机网络——自顶向下方法
benchmark
数学推理能力:AIME 2024
基础的编程能力:Codeforces
对专业知识的理解:GPQA Diamond
基础数学知识:MATH-500
知识的广度:MMLU
Debug 能力:SWE-bench Verified
AIME 2024: Let \(x,y\) and \(z\) be positive real numbers that satisfy the following system of equations: \(\log_2\left({x \over yz}\right) = {1 \over 2}\) \(\log_2\left({y \over xz}\right) = {1 \over 3}\) \(\log_2\left({z \over xy}\right) = {1 \over 4}\) Then the value of \(\left|\log_2(x^4y^3z^2)\right|\) is \(\frac{m}{n}\) where \(m\) and \(n\) are relatively prime positive integers. Find \(m+n\).
Codeforces: Capitalization is writing a word with its first letter as a capital letter. Your task is to capitalize the given word. Note, that during capitalization all the letters except the first one remains unchanged.
GPQA Diamond: Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they be clearly resolved?
MATH-500: Convert the point \((0,3)\) in rectangular coordinates to polar coordinates. Enter your answer in the form \((r,\theta),\) where \(r > 0\) and \(0 \le \theta < 2 \pi.\)
MMLU: Find the degree for the given field extension \(Q(\sqrt{2}, \sqrt{3}, \sqrt{18})\) over \(Q\).
SWE-bench Verified Example
基础的数学推理、基础的编程、一定的 debug 能力、广博的知识面
自从 2023 年 GPT-4 问世以后,每个寒/暑假我都能够在大语言模型的帮助下,编写/维护一些个人项目。
如果没有大语言模型,我不可能在短时间内搭建、部署好这些东西:docker、web 服务器配置文件、代理……
当你有一个 idea 的时候,实现这个 idea 所涉及的一切细节。
君子性非异也,善假于物也。
为什么要学习“任何东西”
对于计算机专业,我个人认为核心素养无非有两条:数学和编程。有了好的想法,没有好的代码去实现这个想法,别人很难有理由认同你的想法。
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
And above all, learning everything in a top-down approach.